75 research outputs found

    Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors

    Get PDF
    This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs

    Predicting coherence communication by tracking synchronization points at run time.”

    Get PDF
    Abstract Predicting target processors that a coherence request must be delivered to can improve the miss handling latency in shared memory systems. In directory coherence protocols, directly communicating with the predicted processors avoids costly indirection to the directory. In snooping protocols, prediction relaxes the high bandwidth requirements by replacing broadcast with multicast. In this work, we propose a new run-time coherence target prediction scheme that exploits the inherent correlation between synchronization points in a program and coherence communication. Our workload-driven analysis shows that by exposing synchronization points to hardware and tracking them at run time, we can simply and effectively track stable and repetitive communication patterns. Based on this observation, we build a predictor that can improve the miss latency of a directory protocol by 13%. Compared with existing address-and instruction-based prediction techniques, our predictor achieves comparable performance using substantially smaller power and storage overheads

    C-AMTE: A location mechanism for flexible cache management in chip multiprocessors

    Get PDF
    This paper describes Constrained Associative-Mapping-of-Tracking-Entries (C-AMTE), a scalable mechanism to facilitate flexible and efficient distributed cache management in large-scale chip multiprocessors (CMPs). C-AMTE enables fast locating of cache blocks in CMP cache schemes that employ one-to-one or one-to-many associative mappings. C-AMTE stores in per-core data structures tracking entries to avoid on-chip interconnect traffic outburst or long distance directory lookups. Simulation results using a full system simulator demonstrate that C-AMTE achieves improvement in cache access latency by up to 34.4%, close to that of a perfect location strategy. © 2010 Elsevier Inc. All rights reserved

    Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

    No full text
    This paper presents and studies a distributed L2 cache management approach through OS-level page allocation for future many-core processors. L2 cache management is a crucial multicore processor design aspect to overcome non-uniform cache access latency for good program performance and to reduce on-chip network traffic and related power consumption. Unlike previously studied hardwarebased private and shared cache designs implementing a "fixed" caching policy, the proposed OS-microarchitecture approach is flexible; it can easily implement a wide spectrum of L2 caching policies without complex hardware support. Furthermore, our approach can provide differentiated execution environment to running programs by dynamically controlling data placement and cache sharing degrees. We discuss key design issues of the proposed approach and present preliminary experimental results showing the promise of our approach

    Taming Single-Thread Program Performance on Many Distributed On-Chip L2 Caches

    No full text
    This paper presents a two-part study on managing distributed NUCA (Non-Uniform Cache Architecture) L2 caches in a future manycore processor to obtain high singlethread program performance. The first part of our study is a limit study where we determine data to cache slice mappings at the memory page granularity based on detailed inter-page conflict information derived from program’s memory reference trace. By considering cache access latency and cache miss rate simultaneously when mapping data to L2 cache slices, this “oracle ” scheme outperforms the conventional shared caching scheme by up to 208 % with an average of 45 % on a sixteen-core processor. In the second part of the study, we propose and evaluate a dynamic cache management scheme that determines the home cache slice and cache bin for memory pages without any static program information. The dynamic scheme outperforms the shared caching scheme by up to 191 % with an average of 32%, achieving much of the performance we observed in the limit study. We also find that the proposed dynamic scheme adapts to multiprogrammed workloads ’ behavior well and performs significantly better than both the private caching scheme and the shared caching scheme

    Reducing Coherence Overhead in Shared-Bus Multiprocessors

    No full text
    To reduce overhead of cache coherence enforcement in shared-bus multiprocessors, we propose a selfinvalidation technique as an extension to write-invalidate protocols in a shared-bus environment. The technique speculatively identifies cache blocks to be invalidated and dynamically determines when to invalidate them locally. We also consider enhancing our self-invalidation scheme by incorporating read snarfing, to reduce the cache misses due to incorrect prediction. We evaluate our self-invalidation scheme by simulating SPLASH-2 benchmark programs that exhibit various reference patterns, under a realistic shared-bus multiprocessor model. We observed considerable reduction of invalidations with an average of 71.6% over the base Illinois protocol for the benchmark programs studied. We also discuss the effectiveness and hardware complexity of self-invalidation and its enhancement with read snarfing in our extended protocol. Keywords: cache coherence, read snarfing, self-invalidation, share..
    • …
    corecore